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1.  INTRODUCTION  AND  SUMMARY 

_ 2 

Let  X- , X.,  ....  X be  a sample  from  F.  Let  X * IX. /n  and  S * 
i i n 1 

— 2 2 

I(X^-X)  . S Is  a very  commonly  encountered  statistic  but  its  exact 

distribution  Is  generally  intractable  except  In  a few  cases  such  as 

a normal  parent  population  or  a mixture  of  normal  populations.  If  F is 

a mixture  of  two  normal  populations  differing  only  in  means  then 

2 

Hyrenious  [3]  gives  the  exact  distribution  of  S as  a binomial 

mixture  of  noncentral  chisquare  distributions.  On  the  other  hand  if  F 

is  a mixture  of  two  normal  distributions  with  common  mean  but  different 
2 

variances  then  S can  be  shown  (see  Appendix)  to  be  distributed  according 

to  a binomial  mixture  of  quadratic  form  distributions.  The  distribution  of 
2 

S is  otherwise  unavailable  but  a number  of  approximations  for  it  are  known. 

The  prominent  among  these  are  the  scaled  chisquare  approximation  due  to 
Box  [2]  and  the  Laguerre  polynomial  series  approximation  by  Roy  and  Tiku 
[8],  which  are  as  follows: 

The  Box  Approximation.  Box*  in  1953,  suggested  approximating  the 
2 

distribution  of  Y ■ S C£  ■ Var(X),  by  a scaled  chisquare  variate  in 

which  the  parameters  are  obtained  by  using  the  first  two  moments.  Specifically, 

t 

Pr(  Y < t )»f^-pb  / yb_1  e "y/p  dy  , (1.1) 

O 

where  p - Var(Y)/m,  b - m/p,  and  m * E(Y)  n-1. 


The  Roy  and  Tiku  Approximation.  Roy  and  Tiku,  in  1962,  suggested 

use  of  Laguerre  polynomials  to  derive  a series  approximation  for  the 

2 

distribution  of  Y = S / ( 2C2  ).  They  proposed. 


Pr(  Y 


< £ ) » / Pm(y)  2 a4  (y)  dy. 

o j-o  J J 


(1.2) 


where  P (y) 
in 


1 m-l  -y 
r(m)  y e » y^o. 


Lj  (m)(y) 


JT  ^ (i)("y)i 


(1.3) 


is  a Laguerre  polynomial  of  degree  j , j > o,  m=  E(Y),  k*  number  of  terms 
in  the  approximation,  and  a^  are  constants  determined  by  using  the  first 
j moments.  Actually, 


r(m>  z rniw^o-D. 

i*o 


(1.4) 


Tan  and  Wong  [11]  show  that  the  Roy  and  Tiku  approximation  can  yield 
very  unreasonable  results  in  case  of  a very  nonnormal  parent  population 
such  as  the  exponential,  the  double  exponential,  or  the  product  normal 
distribution.  They  also  examine  the  two  approximations  and  an  alternative 
series  approximation  Introduced  by  them  in  some  detail  when  F is  a mixture 
of  two  normal  distributions  with  a common  variance  and  different  means. 
They  find  that  the  Roy  and  Tiku  approximation  and  their  alternative  series 
approximation  are  superior  to  the  Box  approximation.  It  may  be  noted  that 
neither  the  Roy-Tiku  nor  the  Tan-Wong  series  approximations  are  very 
convenient  for  approximating  percentiles. 
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In  this  paper  the  approach  of  E.  Wilson  and  M.  Hilferty  [12]  to 


approximating  a chisquare  distribution,  which  was  later  extended  by 

Sankaran  [9]  and  by  Jensen  and  Solomon  [5]  to  other  cases,  is  adapted 

2 

for  developing  a Gaussian  approximation  for  S . The  new  approximation 
is  presented  in  section  2.  In  section  3,  this  approximation  is  compared 
with  the  approximations  due  to  Box  [2]  and  Roy  and  Tiku  [8]  over  a spectrum 
of  parent  populations,  namely,  various  mixtures  of  normal  distributions,  the 
exponential,  the  double  exponential,  the  uniform,  and  the  product  normal 
populations.  The  conclusions  of  the  numerical  study  are  summarized  in 
section  4.  The  Wilson-Hilferty  approximation  is  found  to  yield  a reasonably 
good  and  generally  superior  approximation. 

2.  THE  WILSON-HILFERTY  APPROXIMATION 

Given  a nonnegative  random  variable  Y the  Wilson-Hilferty  approach 
consists  in  obtaining  an  almost  symmetrically  distributed  power  Y*1  of  Y and 
approximating  it  by  a Gaussian  random  variable.  This  reasoning  may  be 
attributed  to  Sankaran  [9]  who  taking  a cue  from  the  Wilson-Hilferty 
approximation  for  a chisquare  distribution  developed  an  approximation 
for  the  noncentral  chisquare  distribution.  It  was  further  abstracted  and 
extended  to  central  and  noncentral  quadratic  form  distributions  by  Jensen 
and  Solomon  [5].  It  may  be  summarized  as  follows. 

Let  <2»  •••  denote  the  cumulants  of  Y and  let  <(>r  ■ Kr/K^»  r * 2,3,.. 
be  bounded.  Then  by  using  the  Taylor  expansion  we  get, 


From  this  the  moment  y^(h)  ■ E [(Y/k^)*1]1  is  obtained  by  substituting 
rh  for  h.  Simple  computations  then  yield  the  following  series  expressions 
for  these  moments  in  terms  of  the  powers  of  (k^)  1 as  follows. 


u 2(h)  “ + h2^21)  M3  + (3h-5)^2  1 + O^"3), 


(2.2) 


P3O1)  ■ ^-2  [^  + 3 (h— 1) d>2  J + 0(k^  3) 


(2.3) 


u4(h)  - 3h4d>2  fKi  + °(Ki  3^* 


(2.4) 


The  exponent  h ■ hQ  which  approximately  syumetrizes  Y obtained  by  equating 
the  leading  term  of  u ^(h)  to  zero  is,  therefore. 


h - 1 - <J 3ic 

o X j z 


(2.5) 


(Y/<^)  0 may  now  be  approximated  by  the  normal  distribution  with  mean 

2 

w (1^,)  and  variance  o (hQ)  * **  2^0^  8iven  by  (2.1)  and  (2.2)  respectively. 


Now  let  XL,  X^»  ••>,  be  the  random  sample  of  size  n from  a 
population  F with  finite  cumulants  C^,  C 2,  ....  Then  it  is  well  known 
(Kendall  and  Stuart  page  290  [6])  that  the  cumulants  <r,  r " 1,2,3  of 
Y - S2/o2  ( o2  • C2  ) are. 


■ (n-1) 

<2  « (n-1)2  [CA/(  no4  ) + 2/ (n-1)] 

<3  - (n-1)3  [C6/n2  + 12C4C2/ (n(n-l) } + 4(n-2)C2/  (n(n-l) } 
+ 8C3/(n-l)2  ]/o6. 


(2.6) 
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It  is  easy  to  see  that  In  this  case  $r  * KT^K\  are  bounded  and 

the  Wilson-Hilferty  approach  is  applicable.  The  exponent  hQ  is  then 

2 

obtained  by  (2.5)  and  y (h  ) and  o (h  ) * U9(h  ) as  described  in  (2.1)  and 

O o 4 0 

(2.2)  respectively.  The  resulting  approximation  to  the  distribution 
2 

function  of  S is  then  given  by, 

Pr(  S2^  t ) ^ *[{  (t/K.)h°  - u(h  )}/o(h  ) ].  (2.7) 

loo 

th  2 

The  corresponding  approximation  to  the  a percentile  of  S is, 

Sa  ~ Kl[  V<V  +u(h0)  ]1/h°  (2*8) 

where  Z is  the  aC^  percentile  of  standard  normal  distribution, 
a 


3.  NUMERICAL  COMPARISONS 

This  section  contains  numerical  comparisons  of  the  Wilson-Hilferty 

2 

approximation  for  the  distribution  of  S with  the  scaled  chisquare  appro- 
ximation due  to  Box  [2]  and  the  Laguerre  polynomial  series  approximation 

due  to  Roy  and  Tiku  [8].  The  comparisons  are  made  by  either  computing  or 

2 

simulating  the  true  distributions  of  S of  samples  from  various  nonnormal 
populations  as  described  below. 

3a.  Mixture  of  Normal  Distributions 
Case  1.  Let  X^,  X^,  . ..,  Xq  be  a random  sample  of  size  n from  a 
population  with  p.d.f. 


f(x)  - pN(  y^o2  ) + (l-p)N(  u2,<t2  ), 


(3.1) 


2 2 
where  0 .<  p ^ 1,  a >0,  - ® < y y 2 <a>  and  N(  y ,a  ) denotes  the 

2 

normal  density  function  with  mean  u and  variance  a . Then  Hyrenlus  [3] 
has  shown  that. 


2 2 

Pr(  S h < 


t ) "^(i)  pi(1-p)“"1  ^(Xn-l^i)  - C)’  (3'2) 
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r 


■ w 


where  x denotes  the  noncentral  chisquare  variable  with  n-1  degrees 
of  freedom  and  the  noncentrality  parameter  * i(n-i)(  y^  ~ u2  ^ /(QO  )• 

A selection  of  the  values  of  the  exact  c.d.f.,  computed  using  (3.2)  and 
the  IMSL  subroutine  MDCH,  together  with  the  errors  of  the  three  appro- 
ximations computed  according  to  (1.1),  (1.2),  and  (2.7)  appear  in  Table  1. 


Case  2.  Let  X^,  X2»  ...»  X^  be  a random  sample  of  size  n from  a 
population  with  p.d.f. 

f(x)  - pN(  y,o2  ) + (l-p)N(  y,a2  ),  (3.3) 

2 2 2 
where  0 ,<  p ,<  1,  a > 0,  o > 0,  - « < y < 00  , and  N(  y ,0  ) denotes 

? 

a normal  density  function  with  mean  y and  variance  a . Then  it  is  shown 
in  Appendix  that. 


Pr(  S2  < t ) = g (l)  P1(l-P)n"1  Pr(2  Y.1  - r >•  (3.4) 

where  as  described  in  Appendix  I X Y.  is  a quadratic  form  in  independently 

j J J 

distributed  normal  variables  . A selection  of  the  values  of  the  exact 
c.d.f.  computed  using  (3.4)  and  the  subroutine  FQUAD  [7]  prepared  from 
the  technique  derived  by  Imhof  [4]  and  the  errors  of  three  approximations 
appear  in  Table  2. 


3b.  Other  Nonnormal  Populations 

The  other  nonnormal  populations  used  for  the  comparisons  are 
(i)  uniform,  (ii)  exponential,  (iii)  product  normal,  and  (iv)  double 
exponential.  The  exact  distributions  of  the  sample  variances  from  these 
populations  are  not  available.  Therefore,  the  c.d.f. 's  are  estimated 
from  the  following  Monte  Carlo  experiments. 
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Using  the  generator  RANDU,  supported  by  the  Digital  Equipment 


1 


Corporation  on  PDP  11/70  computers,  to  generate  U(0,1)  random  variables 

and  transformations  such  as  Box-Meuller  [1]  5000  random  samples  of 

size  20  each  from  the  four  populations  were  obtained.  From  these 

2 

samples  the  empirical  c.d.f.  of  S for  each  population  was  then  const- 
ructed. This  process  was  repeated  seven  times.  For  each  selected  value 
2 

of  S the  average  of  the  seven  values  of  the  £.d.f.  was  used  as  the  value 
of  Monte  Carlo  c.d.f..  The  following  is  a brief  explanation  of  the 
method  used  to  generate  random  samples  for  each  population. 

(i)  Uniform  (0,1):  Use  of  RANDU  subroutine. 

(ii)  Exponential  (1):  Obtain  U = U(0,1)  then  X * -21og(U) . 

(iii)  Product  normal:  X * ziZ2  w^eTe  z^>  i = 1>2  are  i.i.d. 
N(0,1).  Obtain  U^  and  U2  using  RANDU  then  compute 

X - -logCU^  Sin(4  »U2). 

(iv)  Double  exponential  (0,1):  Obtain  U = U(0,1)  then 

X ■ log(2U)  if  U < .5,  or  X = -log[2(l-U)]  otherwise. 

2 

A selection  of  the  values  of  the  empirical  c.d.f.  of  S of  the 
samples  from  the  four  populations  together  with  the  errors  of  the 
three  approximations  appear  in  Table  3. 

4.  CONCLUSIONS 

From  the  numerical  studies  described  in  the  previous  section 
the  following  conclusions  are  drawn.  The  abbreviations  W-H,  R-T,  and 
Box  connote  the  Wilson-Hilferty,  the  Roy  and  Tiku,  and  the  Box  approx- 
imations respectively. 


-8- 


1.  From  Table  1,  corresponding  to  the  mixture  of  two  normal 
distributions  differing  in  means  only  the  following  can  be  observed 


(a)  The  three  approximations  are  reasonable  for  small  values  of 
| y i - y 2 I but  their  quality  deteriorates  as  the  value  of  |y  ^ - y ^ I 
increases,  (b)  As  the  value  of  p Increases  W-H  improves  and  Box 
worsens,  (c)  W-H  is  substantially  superior  to  Box  and  R-T  when  the 
value  of  | y^  - u ^ I is  large;  when  the  value  of  | y ^ - y ^ | is  small 
it  is  slightly  inferior  to  R-T.  Box  is  not  better  than  W-H  anywhere. 

2.  From  Table  2,  corresponding  to  the  mixture  of  two  normal 
distributions  differing  in  variances  only,  the  following  can  be 
observed,  (a)  All  three  approximations  are  reasonable  over  the  range 
of  parameters  considered,  (b)  Box  is  superior  to  W-H  and  R-T  when  p 
is  small  and  the  ratio  of  variances  is  large,  (c)  R-T  is  superior  to 
W-H  and  Box  when  p as  well  as  the  ratio  of  variances  is  small,  (d) 
Otherwise  W-H  and  Box  are  equally  good. 

3.  The  observations  from  Table  3 corresponding  to  the  uniform, 
the  exponential,  the  product  normal,  and  the  double  exponential 
populations  are  as  follows,  (a)  R-T  is  the  poorest  performing 
approximation,  in  general  embarrassingly  so.  Clearly  the  improper 
estimates  of  the  probabilities  are  due  to  truncation  of  the  series 
after  four  terms,  (b)  W-H  is  the  best  of  the  three  approximations. 

Its  performance  appears  to  be  substantially  superior  in  all  four  cases. 

4.  In  summary,  it  is  concluded  that  the  Wilson-Hilferty 
approximation,  derived  in  section  2,  is  a reasonable  approximation 

over  the  spectrum  of  populations  considered.  In  no  case  is  W-H  the 
the  poorest  of  the  three  nor  is  it  embarrassingly  bettered  by  either 

of  the  other  two  approximations.  When  it  is  superior  it  is  substantially 
so. 


TABLE  1.  Exact  C.D.F.  of  S /o  of  Samples  from  pN(  p ^,0  ) + (l-p)N( p 2»°  ) ant*  ErrorB*  of  the 

2 

Approximations  6 = 4 and  N = 11. 
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TABLE  3.  Monte  Carlo  C.D.F.  of  S of  Samples  of  Size  20  from 

Various  Populations  and  Errors  of  the  Approximations. 


t 

(1) 

(2) 

(3) 

n 

(1) 

(2) 

(3) 

(4) 

Uniform 

Exponential 

1.1 

.0703 

15 

-86 

101 

6 

.0496 

-284 

486 

1503 

1.2 

.1252 

10 

-48 

206 

8 

.1179 

-397 

500 

5715 

1.3 

.2009 

18 

37 

285 

12 

.3095 

-314 

169 

5528 

1.5 

.4085 

31 

199 

228 

14 

.4028 

-146 

32 

-4578 

1.6 

.5282 

13 

197 

82 

18 

.5715 

81 

-196 

-18204 

1.7 

.6396 

36 

190 

-35 

21 

.6719 

165 

-268 

-9245 

1.8 

.7410 

32 

127 

-154 

27 

.8134 

164 

-270 

-11618 

2.0 

.8896 

-1 

-23 

-253 

34 

.9039 

100 

-161 

3049 

2.1 

.9335 

0 

-53 

-209 

42 

.9523 

61 

-35 

-2479 

2.2 

. 9628 

-5 

-69 

-144 

50 

.9763 

25 

12 

-920 

Product- Normal 

Double  Exponential 

6 

.0590 

-250 

392 

1732 

15 

.0546 

-110 

240 

439 

8 

.1308 

-320 

371 

6733 

19 

.1234 

-120 

228 

954 

10 

.2188 

-283 

270 

10824 

27 

.3144 

-8 

61 

-160 

14 

.4062 

-91 

-2 

-4713 

31 

.4194 

30 

-56 

-1473 

16 

.4929 

-4 

-112- 

■16790 

35 

.5192 

45 

-153 

-2145 

21 

.6705 

104 

-254- 

■11665 

39 

.6092 

39 

-219 

-1796 

27 

.8085 

121 

-221 

12884 

45 

.7196 

26 

-242 

-120 

34 

.9006 

67 

-128 

3893 

52 

.8125 

23 

-187 

1344 

42 

.9498 

50 

-10 

-2599 

63 

.9043 

-6 

-92 

807 

50 

.9755 

15 

20 

-1015 

76 

.9568 

-11 

-5 

-301 

*Each  C.D.F.  is  estimated  on  the  basis  of  seven  sets  of  5000  samples. 
**  Error  = ( Approximate  C.D.F.  - Monte  Carlo  C.D.F.  )x  10^. 

2 

(1)  Monte  Carlo  C.D.F.  Pr(  S £ t ),  (see  section  3b  ); 

(2)  Error:  Wilson-Hilferty  Approximation  (2.7);  (3)  Error: 

Box  Approximation  (1.2);  (4)  Error:  Roy-Tiku  Approximation 

(1.2). 


APPENDIX 

THE  DISTRIBUTION  OF  SAMPLE  VARIANCE 
FOR  A SCALED  MIXTURE  OF  NORMAL  POPULATIONS 

Let  X-,  X.,  ...»  X be  i.i.d.  random  variables  with  probability 
12  n 

density  function  (p.d.f.) 

f(*)  - PN(0,1)  + (l-p)N(O.a2),  (A.l) 

2 

0 _< p _<  1 an H N(y,a  ) denotes  the  normal  density  function  with  mean 
2 

U and  variance  a . The  corrected  sum  of  squares  may  be  expressed  as  a 
quadratic  form  in  X's  as, 

£ ( X±  - X )2  - X'A  X (A.2) 

where  X'  - (X^  X2>...,  A =■  ( 1^  - n_1  Jq),  and  ^ is  the  n x n 

matrix  of  l's.  Using  this  representation  it  is  easy  to  compute  the 
characteristic  function  of  X’A  X as  given  in  the  following  proposition. 

Proposition:  The  characteristic  function  of  X'A  X is  given  by, 

* (t)  * O Prd-P)n'r  |I  - 2it  A bTr1/2,  (A.3) 

where  jA  is  a matrix 

(A. 4) 

2 

The  p.d.f.  of  3 can  be  obtained  by  inverting  the  above  characteristic 
function.  This  may  be  done  as  follows. 


A 

-»r 


I 

*»r 


o 


o 


I 

~»n-r 


-13- 


Let  A A * B ■ B which  is  a symmetric  matrix  of  order  n. 
-v  ~ r -~r  ~ 


Now  suppressing  the  suffix  r,  there  exists  a nonsingular  matrix  T, 

such  that,  T-1B  T - diag  ( D^,  ) ■ D,  k = number  of 

distinct  eigenvalues  ^ of  B with  respective  multiplicity  n^, 

D.  ■ X.  I , and  2.n.  * n.  Thus, 

~i  i "n^  i 

i k ni 

1 1 - 2it  b|  - |T  I 1 1 - 2it  B|  |t|  - |l  - 2it  D|  - n (1  - 2itX  ±)  . 

i*l 

(A.  5) 

Applying  the  inversion  theorem  to  this  characteristic  function  we 
find  that, 

1 in  k — n. /2 

| I - 2it  A A | ~1/Z  = n ( 1 - 2itX  . ) , (A. 6) 

~ ~r  i-l  1 


is  the  characteristic  function  of  Qr  - 2-X^,  where  are  independent 
2 

X variables.  Hence, 
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